REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OMB  No.  0704-0188 

The  public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data 
sources,  gathering  and  maintaining  the  data  needed  and  completing  and  reviewing  the  collection  of  information  Send  comments  regarding  this  burden  estimate  or  any  other 
aspect  of  this  collection  of  information,  includ  ng  suggestions  for  reducing  the  burden,  to  Department  of  Defense,  Washington  Headquarters  Services,  Directorate  for  Information 
Operations  and  Reports  (0704-0188),  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other 
provision  of  law,  no  person  shall  be  subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  information  if  it  does  not  display  a  currently  valid  OMB  control  number. 

PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS. 

1.  REPORT  DATE  (DD-MM-YYYY)  2.  REPORT  TYPE 

12/23/2013  Final  Progress  Report 

3.  DATES  COVERED  (From  -  To) 

01/01/10-09/30/13 

4.  TITLE  AND  SUBTITLE 

Smart  Distributed  Sensor  Fields 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

N00014-1 0-1 -0477 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

Saligrama,  Venkatesh 

: _ i 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

1 _ 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Trustees  of  Boston  University 

881  Commonwealth  Avenue 

Boston,  MA  02215 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

Office  of  Naval  Research 

495  Summer  Street 

Suite  627 

Boston,  MA  02210-2109 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  Public  Release;  Distribution  is  Unlimited 


13.  SUPPLEMENTARY  NOTES 


2,0  1 0003.. 


14.  ABSTRACT 

Video  cameras  are  critical  to  providing  persistent  surveillance  capabilities  for  situational  awareness.  Currently,  video 
analysis  requires  significant  human  supervision.  Even  many  of  the  routine  tasks  ranging  from  detecting,  identifying, 
localizing/tracking  interesting  events,  discarding  irrelevant  data,  to  providing  actionable  intelligence  currently  requires 
significant  human  supervision.  Human  supervision  is  not  scalable  for  providing  persistent  wide-area  monitoring  and 
particularly  for  monitoring  a  network  of  cameras  that  would  be  generally  employed  for  theater-  level  operations.  We  develop 
methods  for  autonomous  suspicious  activity  detection,  multi-camera  fusion  and  retrieval  algorithms  for  large-scale  WASdata 


15.  SUBJECT  TERMS 

Irregular  and  asymmetric  warfare,  WAS  data,  anomaly  detection,  search  and  retrieval,  low  bandwidth  capability,  low  storage 
ability. 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF  RESPONSIBLE  PERSON 

a.  REPORT 

b.  ABSTRACT 

c.  THIS  PAGE 

ABSTRACT 

OF 

PAGES 

Venkatesh  Saligrama 

U 

u 

u 

uu 

23 

19b.  TELEPHONE  NUMBER  (Include  area  code) 

617-353-1040 

Standard  Form  298  (Rev.  8/98) 
Prescribed  by  ANSI  Std.  Z39.18 


Final  Report 


Project  No:  N00014-10-1-0477 
Thrust:  Asymmetric  and  Irregular  Warfare 

Smart  Distributed  Sensor  Fields:  Algorithms  for  Tactical  Sensors 


Submitted  by 


Principal  Investigator:  Venkatesh  Saligrama,  Professor 

Department  of  Electrical  and  Computer  Engineering 
Boston  University 

8  St.  Mary’s  Street,  Boston,  MA  02215 
Phone:  (617)  353-1040,  Fax:  (G17)  353-6440 
E-mail:  srv@bu.edu 


Submitted  to 


Technical  Point  of  Contact:  Dr.  Martin  Kruger,  Program  Officer 

Office  of  Naval  Research 
ONR  Department  Code:  30 
875  North  Randolph  Street 
Arlington,  VA  22203-1995 


Duration:  Jan  1,  2010  -  Sept  2013 


Contents 


1  Introduction  2 

1.1  Scope  .  2 

1.1.1  Goals,  Objectives  &;  Challenges .  2 

1.2  Operational  Naval  Concept .  2 

2  Deliverables  &  Outcomes  3 

3  Detailed  Technical  Approach  &  Results  3 

3.1  Anomaly  Detection .  4 

3.1.1  Video  Locality  Model  and  Feature  Descriptors .  6 

3.1.2  Algorithm  for  Video  Anomaly  Detection .  7 

3.1.3  Experimental  Results .  8 

3.2  Multi-Camera  Processing .  8 

3.3  Search  &  Retrieval .  10 

3.3.1  Challenges .  10 

3.3.2  Search  Algorithm .  13 

3.3.3  Results .  15 


1 


1 


Introduction 


1.1  Scope 

Video  cameras  are  critical  to  providing  persistent  surveillance  capabilities  for  situational  aware¬ 
ness.  Currently,  video  analysis  requires  significant  human  supervision.  Even  many  of  the  rou¬ 
tine  tasks  ranging  from  detecting,  identifying,  localizing/tracking  interesting  events,  discaiding 
irrelevant  data,  to  providing  actionable  intelligence  currently  requires  significant  human  super¬ 
vision.  Human  supervision  is  not  scalable  for  providing  persistent  wide-area  monitoring  and 
particularly  for  monitoring  a  network  of  cameras  that  would  be  generally  employed  for  theater- 
level  operations.  The  scope  of  this  project  is  to  develop  new  concepts  for  autonomous  video 
analysis  for  highly  cluttered  urban  environments. 

1.1.1  Goals,  Objectives  &  Challenges 

We  present  techniques  for  autonomous  and  distributed  operation  of  wide-area  camera  networks 
for  video  analysis  in  unstructured  and  highly  cluttered  urban  environments.  Our  goals  for  video 
analysis  arc: 

1.  Anomaly  Detection:  Here  we  are  interested  in  developing  novel  algorithms  for  au¬ 
tonomous  real-time  detection  of  suspicious  activity. 

2.  Search  &  Retrieval:  This  task  deals  with  pulling  activities  of  interest  in  large  wide-area 
surveillance  video  based  on  analyst  generated  input  query. 

3.  Multi-Camera  Activity  Fusion:  In  this  task  we  present  algorithms  for  combining 
views  from  multiple  cameras  for  dynamic  scene  characterization  to  improve  detection, 
localization  and  tracking  performance 

The  main  challenges  in  developing  concepts  for  video  analysis  is  that  urban  scenarios  pro¬ 
vide  a  deluge  of  dynamic  data.  Identifying  relevant  information,  such  as  meaningful  change 
detection,  in  urban  clutter  is  not  easy.  Second,  doing  so  reliably,  i.c.,  with  small  false  alarms 
and  missed  detections  is  difficult  and  possibly  impossible  in  harsh  sensing  environments  (camera 
jitter  etc).  Third,  combining  views  from  multiple  cameras  for  dynamic  scene  characterization 
and  to  improve  detection,  localization  and  tracking  performance  is  hard.  Finally,  search  and  re¬ 
trieval  of  activities  of  interest,  indexing  events  of  interest  in  long  videos,  is  extremely  challenging 
not  only  because  of  the  inherent  multi-scale  nature  of  the  phenomena. 

1.2  Operational  Naval  Concept 

Our  main  objective  is  to  develop  an  essentially  autonomous  camera  surveillance  network  in  un¬ 
structured  and  highly  cluttered  urban  &  littoral  environments.  The  new  capabilities  will  include 
real-time  abnormal  activity  detection,  localization  and  tracking  from  multi-camera  systems,  and 
content  summarization  for  fast  indexing  and  search  of  archived  video  data.  The  operational 
performance  improvements  will  include  computationally  efficient  algorithms,  demonstration  in 
highly  cluttered  environments  and  improved  ROC  curves  to  improve  reliability  and  robustness. 
The  proposed  effort  deals  with  several  of  Navy’s  S&  T  focus  areas  in  addition  to  Automated 
Image  Analysis  Asymmetric  and  Irregular  Warfare,  ISR,  and  Information  Integration  &  Fusion. 
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2  Deliverables  Sz  Outcomes 


The  principle  deliverable  for  this  6.1  project  is  this  final  report.  Apart  from  this  deliverable 
the  project  has  resulted  in  several  noticeable  outcomes. 

Transition  Path  The  transition  path  for  this  project  is  towards  building  an  Agile  Tactical 
SNET .  The  search  &  retrieval  concepts  developed  under  this  project  has  generated  significant 
interest.  To  this  end  this  project  has  led  to  a  new  6.2  project  for  developing  Distributed  Search 
Engines  for  Large  Video  Stores .  The  search  algorithm  finds  matches  in  a  hash  table  and  only 
retrieve  video  segments  that  meets  search  criteria.  This  is  useful  for  remote  video  stores  that 
generate  significant  data  and  there  is  insufficient  bandwidth  to  transmit  this  data  in  a  timely 
manner.  We  have  partnered  with  NRL-Stennis  to  help  transition  our  concepts  onto  their 
CisView  platform.  Our  transition  plan  in  the  new  6.2  project  includes  code  development  with 
updates  targeted  for  delivery  to  ONR  each  year. 

Papers  &;  Reports  This  project  has  produced  a  number  of  publications  in  internationally 
recognized  conferences  and  reputed  journals.  We  have  also  presented  this  material  in  a  number 
of  universities  and  parts  of  this  work  has  appeared  as  chapters  in  edited  books.  We  list  these 
publications  here: 

1.  G.  Castanon,  P.-M  Jodoin,  V.  Saligrama,  A.  Caron  Activity  Retrieval  in  Large  Surveil¬ 
lance  Videos ,  Elsevier  E-reference  for  Signal  Processing,  2013 

2.  V.  Saligrama,  Z.  Chen.  Video  Anomaly  Detection  Based  on  Local  Statistical  Aggregates , 
IEEE  Computer  Vision  and  Pattern  Recognition  (CVPR),  2012 

3.  G.  Castanon,  V.  Saligrama,  P.  M.  Jodoin,  A.  Caron,  Exploratory  Search  in  Long  Surveil¬ 
lance  Videos ,  ACM  Multimedia,  2012  (full  paper,  acceptance  rate:  20 

4.  Y.  Benczeth,  P.  Jodoin,  V.  Saligrama,  Abnormality  Detection  Using  Low-Level  Co-occurring 
Events ,  Pattern  Recognition  Letters,  2011 

5.  P.M.  Jodoin,  V.  Saligrama,  J.  Konrad,  Behavior  Subtraction:  A  new  tool  for  Video  Ana¬ 
lytics,  IEEE  Transactions  on  Image  Processing,  Sept  2012 

6.  V.  Saligrama,  J.  Konrad,  P.  M.  Jodoin,  Video  Anomaly  Identification ,  IEEE  Signal  Proc. 
Magazine,  2011 

7.  E.  Ermis,  P.  Clarot,  P.  M.  Jodoin,  V.  Saligrama,  Activity  Based  Matching  in  Distributed 
Camera  Networks ,  IEEE  Transactions  on  Image  Processing,  Sept  2010 

8.  E.  Ermis,  V.  Saligrama,  P.  Jodoin,  Information  Fusion  and  Anomaly  Detection  with  Un¬ 
calibrated  Cameras  in  Surveillance ,  in  Multimedia  Information  Extraction,  M.  Maybury 
(eds),  IEEE  Press,  2012 

9.  Y.  Benezeth,  P.  Jodoin,  V.  Saligrama,  Modeling  Patterns  of  Activity  and  Detecting  Ab¬ 
normal  Events  with  Low  Level  Co-occurrence  of  activity ,  in  Distributed  Video  Sensor 
Networks,  Bir  Bhanu  et.  al.  (eds),  Springer  2011 

3  Detailed  Technical  Approach  &;  Results 

Video  surveillance  has  been  an  area  of  significant  interest  in  both  academia  and  industry.  We 
develop  a  novel  event-based  framework  for  anomaly  detection,  multi-camera  fusion  and  activity 
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retrieval.  Our  approach  is  based  on  statistical  learning  techniques  for  video  analysis.  At  a 
fundamental  level  this  requires  three  steps,  namely, 

1.  Feature  Selection  fc  Extraction:  The  main  goal  here  is  to  select  descriptors  that  are  not 
only  informative  but  also  have  sufficiently  low  complexity  such  that  they  are  robust, 
relatively  easy  to  extract,  and  amenable  to  real-time  analysis.  For  instance,  tracks  are 
high  dimensional  features  that  are  difficult  to  extract  in  cluttered  scenarios.  Our  goal  is 
to  select  informative  low-dimensional  features  that  are  robust  to  photometric  properties 
and  relative  easy  to  extract. 

2.  Feature  Modeling:  The  goal  here  is  to  develop  probabilistic  models  to  characterize  dy¬ 
namic  evolution  of  features  over  space  and  time. 

3.  Video  analysis:  This  involves  algorithms  for  anomaly  detection,  multi-camera  fusion  and 
retrieval. 

3.1  Anomaly  Detection 

Anomaly  detection  for  video  surveillance  has  gained  importance  [2,3,6,11,15,25,29,32,33, 
35,37,57].  Our  focus  is  on  problems,  where  we  are  given  a  set  of  nominal  training  videos 
samples.  Based  on  these  samples  we  need  to  determine  whether  or  not  a  test  video  contains 
an  anomaly.  We  consider  anomalies  in  motion  attributes.  Such  outliers  can  include  (un)usual 
motion  patterns  of  (un)usual  objects  in  (un)usual  locations.  These  encompass  anomalies  such 
as  dropped  baggage,  illegal  U-turns,  and  sudden  movements. 

We  focus  on  anomalies  that  have  local  spatio-temporal  signatures.  The  work  reported 
here  has  appeared  in  our  CVPR  2012  paper  [43].  By  locality  we  mean  that  the  spatio-temporal 
region  surrounding  the  anomalous  region  appears  to  follow  the  nominal  activity  and  carries  little 
information  about  the  anomaly  itself.  For  instance,  the  appearance  of  a  bicyclist  as  shown  in 
Fig.  1  illustrates  spatio-temporal  locality.  As  is  seen  outside  a  small  window  in  time  or  in  space 
the  optical  flow  magnitudes  look  remarkably  similar  to  nominal  activity.  We  also  consider  other 
cases  where  locality  is  only  temporal.  These  include  cases  such  as  sudden  crowd  movement  [1] 
or  illegal  U-turns  [6]  We  exploit  these  ideas  by  proposing  a  statistical  non-parametric  notion 
of  locality  and  derive  data-driven  rules  for  anomaly  detection  with  predictable  performance 
and  statistical  guarantees.  Our  approach  is  related  to  a  number  of  other  non-parametric  data- 
driven  approaches  such  as  [46,  63]  with  important  differences.  Existing  statistical  approaches 
do  not  account  for  local  anomalies,  i  e.,  anomalies  that  are  localized  to  a  small  time  interval 
and/or  spatial  region.  Our  statistical  locality  notion  leads  to  an  elegant  characterization  of 
anomaly  detection  and  suggests  novel  empirical  rules.  A  fundamental  insight  gained  from  our 
theoretical  results  is  that  the  optimal  decision  rules  for  local  anomalies  are  local  irrespective  of 
the  global  statistical  dependencies  exhibited  in  the  nominal  behavior.  This  key  insight  implies 
that  the  inherently  large  ambient  data  dimension  is  inconsequential.  Our  local  empirical  rules 
fuse  local  statistics  and  produce  a  composite  score  for  a  video  segment.  Anomalies  are  declared 
by  ranking  composite  scores  for  video  segments.  Our  anomaly  detection  algorithm  is  described 
in  Fig.  2.  Our  setup  extracts  local  low-level  motion  descriptors  and  resembles  other  common 
approaches.  Adam  et  al.  [2]  use  histograms  of  optical  flows  at  specific  “local  monitors”  to  derive 
decision  rules  for  anomaly  detection  at  those  locations.  Itti  and  Baldi  consider  low-level  feature 
descriptors  at  every  location  [27]  and  use  possion  statistics  for  modeling  nominal  activity. 

We  propose  a  joint  probability  distribution  of  the  low-level  motion  descriptors  under  nom- 
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Figure  1:  Illustration  of  local  anomaly.  Top:  Illustrates  frame  of  a  video  segment  [35]  with  anomaly 
(bicycle).  Bottom  Panel  (Left):  Optical  flow  magnitude  averaged  over  the  red  block  vs.  frame  number 
for  nominal  and  anomalous  video  segments.  (Right):  Optical  flow  magnitude  averaged  over  different 
blocks  along  horizontal  pixel  blocks  for  different  nominal  and  anomalous  video  segments. 


inal  as  well  as  anomalous  distributions.  Such  joint  distributions  have  also  been  considered 
extensively.  Kim  et  al.  [32]  also  extract  local  optical  flow  and  enforce  consistency  across  lo¬ 
cations  through  Maikov  Random  Field  models.  Benezcth  et  al.  [G]  use  binary  background 
subtraction  to  extract  motion  labels  and  then  model  these  local  features  using  a  3D  Markov 
Random  Field  (MRF).  Kratz  et  al.  [33]  extract  spatio-temporal  gradient  to  fit  Gaussian  model, 
and  then  use  HMM  to  detect  abnormal  events.  Mahadevan  et  al.  [35]  model  the  normal  crowd 
behavior  by  mixtures  of  dynamic  textures. 

We  introduce  novel  structural  assumptions  on  the  joint  distributions  to  account  for  spatial 
and  temporal  locality  of  anomalies.  Our  locality  assumption  leads  us  to  consider  statistics 
on  local  3D  brick  patches  (space-time  blocks)  across  different  overlapping  locations.  These 
statistics  are  obtained  through  spatio-temporal  filters  as  shown  in  Fig.  2.  Our  3D  modeling 
superficially  resembles  Boiman  and  Irani  [11]  but  is  different.  They  consider  ensembles  of 
3D  bricks  and  derive  Gaussian  models  for  matching  test  ensembles  at  a  specific  location  with 
corresponding  ensembles  in  a  database.  However,  our  goal  is  statistical  and  docs  not  attempt  to 
match  3D  bricks  at  a  location.  Rather  (see  Fig.  7)  wc  first  compute  location  specific  K-nearest 
neighbor  (NN)  distance  for  each  3D  brick.  We  then  normalize  and  compute  a  composite  score 
by  aggregating  weighted  K-NN  distances  from  all  the  locations.  This  composite  score  is  ranked 
against  other  such  composite  scores  associated  with  training  video  segments.  We  then  declare 
low  scores  as  anomalies.  It  turns  out  that  fusing  local  3D  brick  statistics  in  this  manner  has 
theoretical  significance.  The  empirical  composite  scoring  and  ranking  scheme  asymptotically 
converges  to  the  optimal  decision  rule  for  maximizing  detection  power  subject  to  false  alarm 
constraints. 

Our  work  is  also  related  to  Cong  et.  al.  [15]  who  consider  dictionary  learning  methods. 
There  3D  patches  with  specific  temporal  and  spatial  scale  are  chosen  to  match  each  scenario.  A 
dictionary  of  representative  patterns  are  learnt  based  on  training  video.  Anomalies  are  declared 
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Figure  2:  Overview  of  Anomaly  Detection  Algorithm.  Motion  descriptors  arc  first  extracted  and 
quantized  into  small  blocks.  Spatio-Temporal  filters  at  different  scales  are  applied  to  obtain  smooth 
estimates  at  each  spatio-temporal  location  for  each  feature  descriptor.  Local  KNN  distance  for  each 
location  is  computed  for  training  and  test  video*  These  local  KNN  distances  arc  aggregated  to  produce 
a  composite  score  for  the  test  and  training  video.  The  composite  scores  are  ranked  to  determine 
anomalies. 


if  the  test  sample  cannot  be  represented  using  a  sparse  set  of  dictionary  patterns.  It  is  worth 
mentioning  that  we  could  incorporate  their  ideas  into  our  scheme.  Sparse  decomposition  for 
each  spatio-temporal  scale  can  be  viewed  as  a  feature  vector  that  feeds  into  our  local  KNN 
block  (see  Fig.  7). 

Other  work  on  video  anomaly  detection  includes  social  force  models  by  Mehran  et.  al.  [37]; 
Normalized  cut  clustering  by  Zhong  et.  al.  [64];  and  trajectory  based  methods  [3,25,57], 

3.1.1  Video  Locality  Model  and  Feature  Descriptors 

A  video  snippet  x  is  typically  a  short  segment  of  video.  Training  data  can  consist  of  several 
snippets,  x^x\x^2\  . . . , x^n\  For  theoretical  purposes  we  assume  that  the  different  snippets  are 
independent  of  each  other.  These  snippets  can  be  obtained  by  partitioning  a  longer  video  into 
short  non-overlapping  segments. 

For  a  video  snippet,  £,  we  associate  a  graph  G  =  (V  x  T,  E).  The  set  V  is  associated 
with  spatial  locations  and  the  set  T  is  associated  with  temporal  locations  in  the  video  snippet. 
Each  location,  v  E  V  and  time  t  E  T  is  associated  with  a  feature  descriptor  xv,t*  While  it 
is  theoretically  possible  to  consider  all  pixel  locations  and  temporal  instants,  we  quantize  into 
10  x  10  x  5  non-overlapping  blocks.  We  call  these  blocks  as  atoms  and  we  associate  average 
values  of  features  for  each  atom.  Two  atoms  are  connected  if  they  are  either  temporal  or  spatial 
neighbors.  The  rest  of  development  with  regards  to  Mask  and  Markov  assumptions  follow  as 
in  the  previous  section  (also  see  Fig.  4). 

Feature  Descriptors:  We  now  describe  local  features  that  are  associated  with  each  node 
(atom)  of  our  graph.  During  feature  extraction  we  compute  a  feature  value  for  each  pixel. 
Then,  the  pixel-level  features  are  condensed  into  a  multi-dimensional  vector  for  each  atom  by 
averaging  each  feature  component  over  all  the  pixels  within  the  atom.  We  use  the  following 
local  features: 


G 


(1)  Persistence :  Activity  is  detected  using  a  basic  background  subtraction  method  (as  for 
instance  in  [G]).  The  initial  background  is  estimated  using  median  of  several  hundred  frames. 
Then,  the  background  is  updated  using  the  running  average  method.  We  flag  each  pixel  as 
part  of  the  background  or  foreground.  Persistence,  for  an  atom,  is  the  percentage  of  foreground 
pixels  in  the  atom. 

(2)  Direction:  Motion  vectors  are  extracted  using  Horn  and  Schunck’s  optical  flow  method  [9]. 
Motion  is  quantized  into  8  directions  and  ail  extra  “idle”  bin  is  used  for  flow  vectors  with  low 
magnitude.  The  feature  for  each  atom  is  a  9-bin  un-normalized  motion  histogram.  The  value 
for  each  bin  corresponds  to  the  number  of  pixels  moving  in  the  direction  associated  with  the 
bin. 

(3)  Motion  Magnitude :  Magnitude  of  motion  vectors  for  each  bin  (except  the  idle  bin)  is 
computed  and  averaged  over  all  the  pixels  in  the  atom. 

We  thus  have  an  11-diiiicnsional  descriptor  for  each  atom.  While  our  setup  is  sufficiently 
general  and  admits  other  descriptors  we  use  only  these  components. 

3.1.2  Algorithm  for  Video  Anomaly  Detection 

Recall  we  are  given  training  video  samples  and  a  test  video  sample.  To  reduce  real-time  delay 
we  breakup  the  test  video  sample  into  test  video  snippets,  . . .  ,rfm\  Our  task  is  to 

determine  which  of  the  test  snippets  contain  an  anomaly.  For  convenience,  we  partition  training 
video  in  to  snippets  •  •  •  ,  each  of  the  same  length  as  a  test  snippet  q.  Our  algorithm 
consist  of  three  steps: 

(1)  Local  Scores:  For  any  snippet  y,  which  denotes  either  a  test  or  training  snippet,  a  local 
score  at  spatial  location  u,  temporal  instant,  £,  and  at  spatio-temporal  scale  s,  is  computed  (see 
Algorithm  1).  We  choose  an  averaging  spatio-temporal  filter  for  simplicity  in  Algorithm  1. 

Algorithm  1  Score  for  y  at  location  (u,£),  at  scale  s . 

Input:  Training  Descriptors,  {xuj-},  Vj,  u ,  r;  K  for  KNN 
Output:  dyj(s) 

1:  Filter  at  scale  s :  Xv}r  <—  Filter s{xu}r) 

2:  Distance  Computation:  dVit,T,j  =  d(yv,*,  Vj 

3:  Compute  dvj,(e)  the  £th.  nearest  neighbor  distance  by  sorting  dv  ^rj. 

4:  Avei &ge:dV}t  4  X-^= k+\ 

5:  Normalize  dV)t  <—  where  Dv  =  rnax^d^ 


(2)  Snippet  Score:  Compute  composite  score  for  snippet,  y,  from  local  scores  obtained  in 
Algorithm  1: 

dy(s)  =  ma  xdvt(s) 

v,t 

(3)  Anomaly  Detection:  Rank  test  snippet,  rj  at  scale  s: 

1  n 

3= i 

Note  that  our  feature  descriptors-magnitude,  direction,  persistency — have  different  dynamic 
ranges.  Here  we  ranked  separately  with  respect  to  the  different  descriptors.  Anomalies  are 
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declared  if  the  rank  for  any  descriptor  falls  below  the  desired  false  alarm  threshold.  Anomaly 
is  localized  by  identifying  the  spatial  and  temporal  locations  in  the  snippet  that  contribute 
towards  achieving  the  rank,  Rs(rj). 

Tuning  Parameters:  Our  algorithm  requires  only  two  parameters,  namely,  K  for  KNN  dis¬ 
tance  computation  and  scale  s.  It  turns  out  that  our  results  are  generally  robust  to  a  wide  range 
of  I\  and  is  not  an  issue.  In  all  our  simulations  we  choose  K  to  be  about  50.  Scale  s  can  be 
dealt  with  in  two  possible  ways:  (1)  Compute  ranks  over  different  scales  and  declare  anomaly  if 
the  rank  at  some  scale  falls  below  the  threshold.  This  procedure  is  conservative;  Nevertheless, 
it  controls  false  alarms  at  desired  level  asymptotically.  (2)  Use  context  to  determine  sensible 
temporal  and  spatial  scales.  This  idea  has  been  used  before  by  Cong  et.  al.  15],  who  choose 
appropriate  basis  depending  on  the  scenario.  We  choose  small  scales  if  small  scale  anomalies 
(abandoned  or  unusual  objects)  are  important  and  choose  larger  scales  for  spatial  anomalies 
such  as  U-turns  or  global  change  in  behavior. 

Computational  Issues:  KNN  distance  computation  is  our  main  bottleneck.  It  scales  linearly 
with  the  number  of  3D  bricks.  To  overcome  this  drawback  recent  approaches  for  computing 
approximate  nearest  neighbors  based  on  locality  sensitive  hashing(LSII)  [4]  can  be  used.  While 
we  do  not  present  results  based  on  LSI!  here,  in  our  preliminary  experiments  we  have  noticed 
that  it  can  drastically  reduce  the  computation  time  (scaling  as  fourth  root  of  the  number  of  3D 
bricks)  with  little  loss  in  performance. 

3.1.3  Experimental  Results 

The  UCSD  Pedl  dataset  [54]  contains  34  training  clips  of  nominal  patterns  and  36  testing  clips 
of  various  abnormal  events,  for  example,  bicycles,  skaters,  carts,  etc.  Each  clip  has  200  frames 
(20  seconds),  with  a  158  x  238  resolution.  The  challenge  in  this  dataset  is  that  the  scenes  are 
extremely  crowded.  To  apply  our  algorithm  first  we  calculated  optical  flow  and  aggregated 
optical  flow  into  histogram  and  magnitude  features.  We  divided  the  videos  into  overlapping 
spatio-temporal  blocks  of  30pixelsx20pixelsx5frames  (the  block  size  was  chosen  such  that  each 
block  does  not  contain  too  many  objects  which  may  interfere  with  one  another)  and  then  we 
applied  our  algorithm  on  snippets  consisting  of  5  frames.  We  also  experimented  with  larger 
snippets  and  noticed  little  performance  degradation. 

Some  image  results  are  shown  in  Figure  3.  Our  algorithm  can  detect  different  types  of 
anomalies.  We  compared  our  method  with  SRC  proposed  in  [15]  and  MDT  proposed  in 
[35].  We  also  compared  our  method  with  Social  force  and  MPPCA,  etc.  We  found  that  [43]  our 
method  outperforms  all  the  other  algorithms.  In  Table  1,  some  evaluation  results  are  presented: 
the  Equal  Error  Rate  (EER)  (ours  16%  <  19%  [15]),  and  Area  Under  Curve  (AUC)  (ours 
92.7%  >  86%  15]).  From  these  comparisons,  we  can  conclude  that  our  algorithm  outperforms 
other  state-of-the-art  algorithms.  One  additional  advantage  of  our  algorithm  is  that  while 
providing  frame  level  results,  we  can  also  provide  anomaly  localization  by  back-tracing  to  the 
block  with  max  statistics. 

3.2  Multi-Camera  Processing 

We  have  developed  algorithms  for  finding  pixel  level  correspondences  between  multiple  cameras 
that  have  partially  overlapping  field  of  views.  Our  problem  is  motivated  by  the  wide  area 
surveillance  applications.  We  present  algorithms  for  settings  where  cameras  have  significantly 
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Figure  3:  Abnormal  event  detections  for  UCSD  Pedl  datasets.  The  objects  such  as  cars,  bicycles, 
skaters  are  all  well  detected. 

different  orientations  and  zoom  levels  with  respect  to  the  scene.  We  propose  a  correspondence 
method  based  on  activity  features  that,  unlike  photometric  features,  have  certain  geometry 
independence  properties.  The  proposed  method  works  is  directed  towards  general  surveillance 
scenarios,  where  prior  calibration  is  not  possible.  In  addition  we  seed  techniques  which  require 
little  processing  power,  and  is  communication  resource  aware. 

Tracking  moving  objects  in  video  is  a  difficult  problem  and  has  been  approached  from  many 
different  angles.  Some  of  the  most  successful  techniques  are  particle  filtering  [26]  and  covariance- 
matrix  techniques  [40,53] .  The  latter  techniques  have  proved  robust  to  size  scaling,  pose  change, 
illumination  variations,  while  at  the  same  time  can  be  efficiently  implemented  using  the  integral 
image  concept.  The  main  challenges  in  their  adoption  will  be  in  identifying  covariance  features 
to  use  (e.g.,  luminance  and  color  can  be  tracked  jointly  with  position,  structure,  velocity, 
etc.)  and  extending  the  approach  to  multiple  cameras.  While  classical  multiple-camera  object 
tracking,  dominant  in  computer  vision  literature,  hinges  on  frame-to-frame  correspondence 
between  cameras,  we  propose  to  establish  such  correspondence  using  only  the  event-based 
representations  [19,20].  Dynamic  events,  unless  occluded,  leave  a  unique  signature  in  cainera- 
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Method 

EER 

AUC 

MPPCA  [35] 

40% 

59% 

SF  [35] 

31% 

67.5% 

MDT  [35] 

25% 

81.8% 

Sparse  r15] 

19% 

86% 

Ours 

16% 

92.7% 

Table  1:  Quantitative  comparison  of  our  algorithm  with  [15]  and  [35].  EER  is  equal  error  rate  and 
AUC  is  the  area  under  ROC. 

acquired  views  regardless  of  the  projection  angle.  The  figure  below  demonstrates  a  number 
of  correspondence  experiments  conducted.  The  task  is  to  determine  the  mapping  between 
locations  across  different  cameras  that  share  an  overlapping  field  of  view. 

While  a  car  traveling  on  a  highway  induces  similar  luminance/color  pattern  changes  on 
cameras  viewing  it  from  different  angles,  it  also  induces  similar  patterns  of  dynamic  events 
(e.g.,  sequences  of  idle  and  busy  periods)  across  views.  By  seeking  to  associate  dynamic  events, 
instead  of  brightness  patterns,  across  different  views  we  bypass  the  difficult  issues  related  to 
3-D  geometry  of  viewing  angles  and  high-bandwidth  requirements.  We  expect  the  event-based 
correspondence  to  be  also  helpful  in  activity  recognition  on  account  of  multiple  sources  observing 
the  same  target.  A  detailed  description  of  this  approach  has  appeared  in  [21] 

The  general  method  heretofore  is  the  so  called  scale  invariant  feature  transform  (SIFT) 
based  method.  The  main  difficulty  is  that  when  the  cameras  have  significantly  different  ori¬ 
entations,  SIFT  method  fails  to  produce  meaningful  results.  However,  activity  features  are 
geometrically  independent  demonstrating  the  effectiveness  of  the  proposed  method  for  a  large 
class  of  surveillance  scenarios. 

3.3  Search  &  Retrieval 

The  problem  of  exploratory  search  is  motivated  by  the  need  for  searching  large  video  stores 
that  are  produced  remotely.  We  are  interested  in  finding  activities  that  match  a  wide  variety 
of  queries.  The  need  for  this  technology  at  this  time  is  to  enable  Asymmetric  and  Irregular 
Warfare  capability.  Specifically,  this  aspect  of  the  project  is  aimed  at  enabling  analytics  for  wide 
area  imagery.  The  goal  is  to  enable  searching  of  suspicious  activity  in  cluttered  environments 
while  using  low  bandwidth  and  low  storage  in  conducting  search.  This  capability  will  allow  a 
remote  user  to  search  terabytes  of  wide  area  imagery  efficiently  by  operating  on  a  compressed 
data  space  such  as  a  hash  table  and  enable  pulling  video  that  meets  search  criteria. 

3.3.1  Challenges 

The  main  challenges  that  arise  in  an  Exploratory  Search  system  in  large-scale  surveillance 
videos  are  listed  below: 

1. )  Data  lifetime:  since  video  is  constantly  streamed,  there  is  a  perpetual  renewal  of  video 
data.  This  calls  for  a  model  that  can  be  updated  incrementally  as  video  data  is  made  available. 
The  model  must  also  scale  well  with  the  temporal  mass  of  the  video. 

2. )  Unpredictable  queries:  the  nature  of  queries  depends  on  the  field  of  view  of  the  camera, 
the  scene  itself  and  the  type  of  events  being  observed.  The  system  should  support  queries  of 
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(d)  (e)  (f) 


(g) 


Figure  4:  Occlusion  map  using  left-right  check  and  the  proposed  method:  (a,d,g,j,m)  Camera  1 
frames,  (b,e,h  k,n)  Camera  2  frames,  (c,f,i,l,o)  Segmentation  results  for  Camera  1  frames:  Red 
regions  appear  in  Camera  1  frame  but  not  in  Camera  2  frame,  blue  regions  are  common  in 
Camera  1  and  Camera  2,  green  regions  carry  no  motion. 
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different  nature  that  can  retrieve  both  recurrent  events  such  as  people  entering  buildings  and 
infrequent  events  such  as  cars  performing  U-turns,  cars  passing,  car  dismounts  etc. 

3. )  Unpredictable  event  duration  events  are  unstructured.  They  start  anytime,  vary  in 
length,  and  overlap  with  other  events.  The  system  is  nonetheless  expected  to  return  complete 
events  regardless  of  their  duration  and  whether  or  not  other  events  occur  simultaneously. 

4. )  Clutter  and  occlusions:  Tracking  and  tagging  objects  in  urban  videos  is  challenging  due 
to  occlusions  and  clutter;  especially  when  real-time  performance  is  required. 

5. )  Airborne  Videos:  Airborne  videos  offer  new  challenges.  They  record  wide-area  imagery 
and  use  mega-pixel  cameras.  On  the  other  hand  due  to  their  constant  motion  registration 
is  required  before  the  footage  can  be  subjected  to  video  analysis.  Imperfect  registration  is  a 
common  issue  and  this  requires  methods  that  are  robust  to  imperfections  introduced  in  the 
registration  process. 

Related  Work: 

Image  Based  Approaches:  There  has  been  significant  work  in  the  literature  on  indexing  images. 
Nevertheless,  these  approaches  run  into  immediate  problems  in  video  search:  the  size  of  the 
data  representation  is  of  the  same  order  (or  larger)  than  the  images  themselves.  Furthermore, 
some  traditional  approaches  are  based  on  image  retrieval,  which  relies  on  matching  a  bag  of 
features.  In  contrast  video  activities  have  a  time  component.  Therefore,  a  meaningful  match 
must  not  only  be  matched  at  the  frame  level  but  also  coherent  in  time  to  capture  semantically 
meaningful  spatio-temporal  relationships.  Most  video  papers  devoted  to  summarization,  search 
and  retrieval  focus  on  broadcast  videos  such  as  music  clips,  sports  games,  movies,  etc.  These 
methods  typically  divide  the  video  into  “shots”  [18,48,50,51]  by  locating  and  annotating  key 
frames  corresponding  to  scene  transitions.  The  search  procedure  exploits  the  key  frames  content 
and  matches  either  low-level  descriptors  [50]  or  higher-level  semantic  meta-tags  to  a  given 
query  [65].  Unfortunately,  surveillance  videos  are  fundamentally  different  from  conventional 
videos.  Most  surveillance  videos  often  contains  many  unrelated  activities  and  events  and  so 
surveillance  videos  cannot  be  decomposed  into  “scenes”  separated  by  key  frames  that  one  could 
summarize  with  some  meta  tags  or  a  global  mathematical  model.  Furthermore,  surveillance 
video  have  no  closed-caption  or  audio  track  one  could  rely  on  [65]. 

Video  Clustering  Based  Approaches:  This  motivates  approaches  that  attempts  to  index  the 
dynamic  content  of  the  video  in  a  way  that  is  compatible  with  arbitrary  upcoming  user-defined 
queries.  In  that  perspective,  most  scene-understanding  video  analytic  methods  work  on  a  two- 
stage  procedure:  (1)  learn  patterns  of  activities  via  some  clustering/learning  procedure  and 
than  (2)  recognize  new  patterns  of  activity  via  some  classification  stage.  Since  activities  in 
public  areas  often  follow  some  basic  rules  (think  of  traffic  lights,  highways,  building  entries, 
etc)  the  training  stage  often  quantifies  space  and  time  into  a  number  of  states  with  transition 
probabilities.  Common  models  are  HMMs  [34,38,41,55,62],  Bayesian  networks  [12,59],  context 
free  grammars  [56],  and  other  graphical  model  [36.49,57].  As  for  the  classification  stage,  it  is 
either  used  to  recognize  pre-defined  patterns  of  activity  [8,10, 16, 23, 28, 47, 49, 60, 61  ]  (useful  for 
counting  [16,52])  or  detect  anomalies  by  flagging  everything  that  deviates  from  what  has  been 
previously  learned  [5, 14,25,39,41,42,45].  We  also  note  that  methods  working  on  global  behavior 
understanding  often  rely  on  tracking  [12,36,41,55,58]  while  those  devoted  to  isolated  action 
recognition  relies  more  on  low-level  features  [10,17,23,47,61].  Although  these  methods  could 
probably  be  tuned  to  index  the  video  and  facilitate  search,  very  few  papers  explicitly  address 
this  question.  One  such  paper  is  the  one  by  Wang  et  al  [57].  There  method  decomposes 
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the  video  into  clips  in  which  the  local  motion  is  quantized  into  words.  These  words  are  than 
clustered  into  so-called  topics  such  that  each  clip  is  modeled  as  a  distribution  over  these  topics. 
Queries  being  a  combination  of  these  topics,  their  search  algorithm  fetches  every  clip  containing 
all  of  the  topics  mentioned  in  the  query.  A  similar  approach  can  be  found  in  [24,34,59] 

But  search  techniques  focused  on  global  explanations  operate  at  a  competitive  disadvantage: 
the  preponderance  of  clutter  (requirement  four)  in  surveillance  video  makes  the  training  step 
of  scene  understanding  prohibitively  difficult.  Second,  since  these  techniques  often  focus  on 
understanding  recurrent  activities,  they  are  unsuitable  for  retrieving  infrequent  events  -  this 
can  be  a  problem,  given  that  queries  are  unpredictable  (requirement  two).  Finally,  the  training 
step  in  scene  understanding  can  be  prohibitively  expensive,  violating  requirement  three,  large 
data  lifetimes. 

3.3.2  Search  Algorithm 

The  concepts  developed  here  accounts  for  the  challenges  posed  by  search  in  large  surveillance 
videos  and  overcomes  some  of  the  drawbacks  of  traditional  approaches.  First,  we  extract  a  full 
set  of  features  as  we  have  no  a  priori  knowledge  of  what  query  will  be  asked.  Unlike  scene 
understanding  techniques,  we  have  no  training  step;  this  would  be  incompatible  with  the  data 
lifetimes  and  magnitudes  of  the  corpus.  Instead,  we  develop  an  approach  based  on  exploiting 
temporal  orders  on  simple  features,  which  allows  us  to  find  arbitrary  queries  quickly  while 
maintaining  low  false  alarm  rates.  The  results  reported  here  is  based  on  work  that  has  already 
resulted  in  a  number  of  publications  [7,13,19,20,30,31,44]. 

We  tackle  the  aforementioned  challenges  through  the  following  sub-components: 

Efficient  Representation:  First,  a  capability  for  efficient  representation  and  storage  of  the 
data  must  be  developed;  preferably,  one  which  is  smaller  than  the  video,  as  even  simple  surveil¬ 
lance  video  can  represent  terabytes  of  data.  The  stored  video  must  be  sufficiently  informative 
so  that  activities  that  could  potentially  of  interest  is  preserved  in  the  compressed  representa¬ 
tion.  In  order  to  address  these  challenges  arising  from  efficient  representation,  we  propose  to 
employ  simple  pixel-level  features  (Motion,  Size,  Color,  Persistence),  and  rely  on  their  spatio- 
temporal  relationships  to  identify  query  matches.  We  propose  to  divide  up  incoming  video  into 
space- time  cubes  (and  pyramids  of  those  cubes,  called  trees),  and  compute  amalgamations  of 
pixel-level  features  for  each  cube  and  pyramid  as  illustrated  in  Fig  2.  Each  of  these  feature 
trees  is  hashed  using  locality  sensitive  hashing  (LSH)  for  fast  retrieval.  This  approach  yields 
a  highly  efficient  representation  of  the  video;  a  5-hour,  7  GB  video  is  compressed  to  a  5  MB 
index. 

Query  Representation:  A  query  pattern  must  be  generated,  either  directly  or  via  exemplars, 
to  search  through  the  video.  This  query  pattern  must  be  sufficiently  descriptive  to  account  for 
semantically  meaningful  spatio-temporal  relationships.  An  example  of  such  a  query  interface  is 
illustrated  in  Fig.  6. 

Search  Algorithm:  The  search  algorithm  must  efficiently  find  semantically  meaningful  matches 
to  its  query  pattern  in  its  data  representation.  To  describe  the  search  engine  we  will  first  present 
a  block  diagram  view  of  the  overall  system.  The  main  idea  is  to  reduce  the  problem  to  the 
relevant  data,  and  then  reason  intelligently  over  that  data.  This  process  is  shown  in  Fig.  7.  As 
data  streams  in,  video  is  pre-processed  to  extract  relevant  features  -  activity,  object  size,  color, 
persistence  and  motion. These  low-level  features  are  hashed  into  a  fuzzy,  light-weight  lookup 
table  by  means  of  LSH  [22].  We  propose  LSH  because  it  can  account  for  spatial  variability  and 


13 


Figure  5:  (Left)  Given  anH^  xH  xF  video,  documents  are  non-overlapping  video  clips  each  containing 
A  frames.  Each  of  the  frames  are  divided  into  tiles  of  size  B  x  B.  Tiles  form  an  atom  when  aggregated 
together  over  A  frames.  (Right)  Atoms  are  grouped  into  two- level  trees  -  every  adjacent  set  of  four 
atoms  is  aggregated  into  a  parent,  forming  a  set  of  partially  overlapping  trees. 


reduces  the  search  space  for  a  user’s  query  to  the  set  of  relevant  partial  (local)  matches. 

Our  search  engine  optimizes  over  the  partial  matches  to  produce  full  matches ;  segments  of 
video  which  fit  the  entire  query  pattern,  as  opposed  to  part  of  it.  This  optimization  operates 
from  the  advantageous  standpoint  of  having  only  to  reason  over  the  partial  matches,  which  are 
the  relevant  subset  of  the  video.  In  surveillance  video,  where  a  long  time  can  pass  without 
relevant  action,  this  dramatically  reduces  the  workload  of  the  optimization  algorithm.  The  first 
is  a  greedy  approach  which  flattens  the  query  in  time.  Second,  a  novel  dynamic  programming 
(DP)  approach,  exploits  the  causal  ordering  of  component  actions  that  makeup  a  query.  DP 
reasons  over  the  set  of  partial  matches  and  finds  the  best  full  match. 

In  our  work  we  have  developed  GUIs  to  describe  queries.  The  user  enters  the  number  of 
action  components  which  the  analyst  wishes  to  find,  and  then  draws  the  motion  patterns  for 
those  actions.  In  order  to  recognize  the  complete  set  of  actions  in  the  video,  we  first  get  the 
set  of  matches  to  each  individual  action  component.  Because  we  have  the  video  components 
hashed,  this  is  an  incredibly  quick  lookup  -  it  is  linear  in  the  number  of  matching  components. 
This  is  a  convenient  property,  because  action  is  frequently  sparse  in  a  video,  and  so  scaling  with 
the  number  of  matches  makes  the  actual  length  of  the  video  irrelevant  for  performance.  All 
that  matters  is  the  amount  of  action  in  the  video.  Once  we  are  given  a  set  of  matches  for  each 
action  component,  the  search  for  a  full  match  can  be  formulated  as  a  dynamic  programming 
problem.  We  employ  the  Smith- Waterman  algorithm  for  genome-matching  to  find  the  ranked 
set  of  matches  in  the  video. 

Our  system  currently  supports  a  combination  of  motion  and  object  type  queries.  In  the 
follow  on  project  we  would  like  to  extend  our  approach  to  support  longer  term  activities  where 
the  routes  or  motion  attributes  could  be  uncertain.  These  type  of  queries  could  involve  activities 
corresponding  multiple  destinations  over  large  time  scales.  We  are  also  currently  investigating 
how  to  extend  our  GUI  based  querying  to  represent  complex  motion  patterns.  In  this  context 
we  are  planning  on  moving  out  of  the  realm  of  manually-created  queries  into  exemplar-based 
querying.  Sometimes,  the  identifying  structure  of  an  action  may  be  difficult  to  decipher  for  a 
user,  but  they  could  provide  a  number  of  examples  (”I  know  it  when  I  see  it”).  We  propose  to 


14 


(2) 
Draw 
query 


(4)  Select 
Query 
Type 


(1)  Select  a  predefined  video  section 


(3)  Select  target  properties 


(5)  Execute 
Query 


Figure  6:  The  query  creation  GUI  provides  a  straightforward  way  to  construct  queries.  The  user 
draws  each  action  component  (shown  in  blue),  and  can  additionally  specify  features. 


Task 

Frames 

Resolution 

#  queries 

Retrievals 

False  Retrievals 

Groundtruth 

Time/query 

i 

500 

3850  x  5950 

29 

33 

i 

33 

1.73  sec. 

2 

500 

1951  x  5950 

11 

11 

0 

11 

0.052  sec. 

Table  2:  Results  for  processing  the  Airborne  data.  Car  queries  representing  routes  as  long  as  1500 
pels  were  examined.  The  specs  of  the  computer  used  is  Intel  Core  i5,  @  2.67  GHz  2.66  GHz,  4.0GB 
RAM. 

break  down  exemplar  videos  to  produce  action  components  for  search. 

3.3.3  Results 

Table  2  summarizes  the  results  obtained  from  processing  the  Airborne  data.  We  examined 
routes  as  longs  as  1500  pels.  Examples  of  somes  of  the  examined  routes  are  shown  in  Fig.  8. 
Some  of  those  routes  undergo  strong  occlusion  and  others  have  many  turns.  Fig.  9  shows 
the  ROC  for  the  tasks  1  and  2.  The  Retrieval  Score  Threshold  is  the  minimum  path  score 
(generated  by  Algorithm)  required  in  order  to  declare  this  path  as  a  search  result.  The  points  on 
Fig.  9  represent  20  different  Retrieval  Score  Threshold  values  in  the  range  of  0:1:20  (MATLAB 
notations).  As  seen  by  the  generated  ROC,  our  technique  performs  well  by  generating  0.85 
correct  detection  rate  with  0.1  false  alarms. 

This  approach  represents  a  fundamentally  different  way  of  approaching  the  video  search 
problem.  Rather  than  relying  on  ail  abundance  of  training  data  or  finely-tuned  features  to 
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Figure  7:  From  streaming  video  low  level  features  for  each  document  are  computed  and  inserted  into 
a  fuzzy,  lightweight  index.  A  user  inputs  a  query,  and  partial  matches  (features  which  are  close  to 
parts  of  the  query)  are  inserted  into  a  dynamic  programming  (DP)  algorithm.  The  algorithm  extracts 
the  set  of  video  segments  which  best  matches  the  query. 

differentiate  actions  of  interest  from  noise,  we  rely  on  simple  features  and  causality.  In  addition 
to  the  clear  benefits  in  terms  of  a  run-time  which  scales  sub-linear ly  with  the  length  of  the  video 
corpus,  the  simple  features  and  hashing  approach  render  the  approach  robust  to  user  error 
as  well  as  poor-quality  video.  The  results  demonstrate  clearly  that  causality  and  temporal 
structure  can  be  powerful  tools  to  reduce  false  alarms.  Another  added  benefit  is  how  the 
algorithm  scales  with  query  complexity.  Whereas  algorithms  such  as  topic  modeling  or  a  feature- 
based  matching  suffer  as  queries  becomes  more  complex  due  to  efforts  to  characterize  the  query, 
the  two-step  approach  becomes  more  successful  -  the  more  action  components  in  a  query,  the 
more  likely  it  is  to  differentiate  itself  from  noise.  There  is,  of  course,  non-temporal  structure  that 
we  have  yet  to  exploit.  Spatial  positioning  of  queries,  such  as  'The  second  action  component 
must  occur  to  the  northeast  of  the  first”,  or  “The  second  action  component  must  be  near  the 
first”  is  a  simple  attribute  which  may  further  differentiate  queries  of  interest  from  background 
noise.  This  is  not  to  say  that  the  approach  is  not  without  its  limitations.  It  requires  that  the 
activity  being  described  contain  discrete  states,  each  of  which  is  describable  by  a  simple  feature 
vocabulary.  Complex  actions  like  sign  language  or  actions  which  are  to  fast  or  too  small  to  be 
identified  at  the  atom  level  will  be  difficult  to  search  for. 
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Figure  8:  Examples  of  the  examined  routes.  Routes  are  shown  in  red  and  they  start  from  point  X 
and  end  at  point  Y.  Some  of  the  routes  undergo  strong  occlusion  (see  blue  region,  second  row,  left) 
and  others  undergo  many  turns  (see  second  row,  right). 
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